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Abstract 

Sequential prediction problems such as imitation 
learning, where future observations depend on 
previous predictions (actions), violate the com- 
mon i.i.d. assumptions made in statistical learn- 
ing. This leads to poor performance in theory 
and often in practice. Some recent approaches 



strations of good behavior are used to learn a controller, 
have proven very useful in practice and have led to state- 



(Dau me III et al.[|2009)|Ross and Bagnell] [2010) 
provide stronger guarantees in this setting, but re- 
main somewhat unsatisfactory as they train either 
non-stationary or stochastic policies and require 
a large number of iterations. In this paper, we 
propose a new iterative algorithm, which trains a 
stationary deterministic policy, that can be seen 
as a no regret algorithm in an online learning set- 
ting. We show that any such no regret algorithm, 
combined with additional reduction assumptions, 
must find a policy with good performance under 
the distribution of observations it induces in such 
sequential settings. We demonstrate that this 
new approach outperforms previous approaches 
on two challenging imitation learning problems 
and a benchmark sequence labeling problem. 



1 INTRODUCTION 

Sequence Prediction problems arise commonly in practice. 
For instance, most robotic systems must be able to pre- 
dict/make a sequence of actions given a sequence of obser- 
vations revealed to them over time. In complex robotic sys- 
tems where standard control methods fail, we must often 
resort to learning a controller that can make such predic- 
tions. Imitation learning techniques, where expert demon- 
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Schaal 


1999; Abbeel and Ng 2004 


Ratliff et al.| 2006 




Silver 


|et~aLl|2008[|Argall et al.| 2009 
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2009 


Ross and Bagnell 2010). A typical approach to imitation 



learning is to train a classifier or regressor to predict an ex- 
pert's behavior given training data of the encountered ob- 
servations (input) and actions (output) performed by the ex- 
pert. However since the learner's prediction affects future 
input observations/states during execution of the learned 
policy, this violate the crucial i.i.d. assumption made by 
most statistical learning approaches. 

Ignoring this issue leads to poor performance both in the- 
ory and practice (|Ross and Bagnell 2010). In particular, 



a classifier that makes a mistake with probability e under 
the distribution of states/observations encountered by the 
expert can make as many as T 2 e mistakes in expectation 
over T-steps under the distribution of states the classifier 
itself induces (Ross and Bagnell, 2010). Intuitively this is 
because as soon as the learner makes a mistake, it may en- 
counter completely different observations than those under 
expert demonstration, leading to a compounding of errors. 



Recent approaches (Ross and Bagnell 2010) can guarantee 



an expected number of mistakes linear (or nearly so) in the 
task horizon T and error e by training over several itera- 
tions and allowing the learner to influence the input states 
where expert demonstration is provided (through execution 



of its own controls in the system). One approach (Ross and 



Bagnell 2010) learns a non-stationary policy by training 
a different policy for each time step in sequence, starting 
from the first step. Unfortunately this is impractical when 
T is large or ill-defined. Another approach called SMILe 
(|Ross and Bagnell| [20T0)l, similar to SEARN (|Daume III| 
et al.||2009| l and CPlj jKakade and Langfordj |2002| ), trains 
a stationary stochastic policy (a finite mixture of policies) 
by adding a new policy to the mixture at each iteration of 
training. However this may be unsatisfactory for practical 
applications as some policies in the mixture are worse than 
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others and the learned controller may be unstable. 

We propose a new meta-algorithm for imitation learning 
which learns a stationary deterministic policy guaranteed 
to perform well under its induced distribution of states 
(number of mistakes/costs that grows linearly in T and 
classification cost e). We take a reduction-based approach 



(Beygelzimer et al. 2005 } that enables reusing existing su- 
pervised learning algorithms. Our approach is simple to 
implement, has no free parameters except the supervised 
learning algorithm sub-routine, and requires a number of 
iterations that scales nearly linearly with the effective hori- 
zon of the problem. It naturally handles continuous as well 
as discrete predictions. Our approach is closely related to 
no regret online learning al gorithms (|Cesa-Bianchi et al. 
[20041 |Hazan et all |2006l |Kakade and Shalev-Shwartz 



2008) (in particular Follow -The-Leader) but better lever- 



ages the expert in our setting. Additionally, we show that 
any no-regret learner can be used in a particular fashion to 
learn a policy that achieves similar guarantees. 

We begin by establishing our notation and setting, discuss 
related work, and then present the DAGGER (Dataset Ag- 
gregation) method. We analyze this approach using a no- 
regret and a reduction approach (Beygelzim er et aL"] |2005 ). 
Beyond the reduction analysis, we consider the sample 
complexity of our approach using online-to-batch (|Cesa 



Bianchietal. 2004) techniques. We demonstrate DAGGER 



is scalable and outperforms previous approaches in practice 
on two challenging imitation learning problems: 1) learn- 
ing to steer a car in a 3D racing game (Super Tux Kart) and 
2) and learning to play Super Mario Bros., given input im- 
age features and corresponding actions by a human expert 



and near-optimal planner respectively. Following Daume 



III et al. ( |2009 1 in treating structured prediction as a de- 



generate imitation learning problem, we apply DAGGER to 
the OCR (Tas kar et al.| [2003) benchmark prediction prob- 
lem achieving results competitive with the state-of-the-art 
(|Taskar et all [2003) [RMiff et al.| [20071 |Daume in eTaT] 
2009|) using only single-pass, greedy prediction. 



2 PRELIMINARIES 

We begin by introducing notation relevant to our setting. 
We denote by IT the class of policies the learner is consid- 
ering and T the task horizon. For any policy tt, we let 
denote the distribution of states at time t if the learner exe- 
cuted policy tt from time step 1 to t — 1. Furthermore, we 
denote d^ — ^ Y^t—i ^ tne average distribution of states 
if we follow policy tt for T steps. Given a state s, we de- 
note C(s, a) the expected immediate cost of performing ac- 
tion a in state s for the task we are considering and denote 
C„-(s) = E a ^„.( s ) [C(s, a)} the expected immediate cost of 
tt in s. We assume C is bounded in [0, 1]. The total cost 
of executing policy tt for T-steps (i.e., the cost-to-go) is 
denoted J(tt) = J2t=i E ^di [C w (s)] = TE s ^[C^(s)]. 



In imitation learning, we may not necessarily know or ob- 
serve true costs C(s,a) for the particular task. Instead, 
we observe expert demonstrations and seek to bound J(tt) 
for any cost function C based on how well tt mimics the 
expert's policy tt* . Denote £ the observed surrogate loss 
function we minimize instead of C. For instance £(s,tt) 
may be the expected 0-1 loss of tt with respect to tt* in 
state s, or a squared/hinge loss of tt with respect to tt* in s. 
Importantly, in many instances, C and I may be the same 
function- for instance, if we are interested in optimizing the 
learner's ability to predict the actions chosen by an expert. 

Our goal is to find a policy -if which minimizes the observed 
surrogate loss under its induced distribution of states, i.e.: 



arg min E sr ^ d7r [£(s, tt)} 
wen 



(1) 



As system dynamics are assumed both unknown and com- 
plex, we cannot compute d n and can only sample it by exe- 
cuting tt in the system. Hence this is a non-i.i.d. supervised 
learning problem due to the dependence of the input distri- 
bution on the policy tt itself. The interaction between pol- 
icy and the resulting distribution makes optimization diffi- 
cult as it results in a non-convex objective even if the loss 
£(s, •) is convex in tt for all states s. We now briefly review 
previous approaches and their guarantees. 

2.1 Supervised Approach to Imitation 

The traditional approach to imitation learning ignores the 
change in distribution and simply trains a policy tt that per- 
forms well under the distribution of states encountered by 
the expert d^* . This can be achieved using any standard 
supervised learning algorithm. It finds the policy Tt sup : 



t sup = arg min E sr ^, djr , [£(s,tt)} 
Ten 



(2) 



Assuming £(s, tt) is the 0-1 loss (or upper bound on the 0- 
1 loss) implies the following performance guarantee with 
respect to any task cost function C bounded in [0, 1]: 



Theorem 2.1. ^Ross and Bagnell] \201 0\ Let 

R s ~d„, [l{s, tt)} = e, then J(tt) < J(tt*) + T*T~ 



Proof. Follows from result in Ross and Bagnell (2010) 
since e is an upper bound on the 0-1 loss of tt in d n * . □ 

Note that this bound is tight, i.e. there exist problems 
such that a policy tt with e 0-1 loss on d v * can incur ex- 



tra cost that grows quadratically in T. Kaariainen (2006) 
demonstrated this in a sequence prediction setting and 

'in their example, an error rate of e > when trained to 
predict the next output in sequence with the previous correct 
output as input can lead to an expected number of mistakes of 

2- — h = over sequences of length T at test time. 

This is bounded by T 2 e and behaves as 0(T 2 e) for small e. 
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Ross and Bagnell (2010 ) provided an imitation learning ex- 
ample where J(tt sup ) = (1 — eT)J(ir*) + T 2 e. Hence the 
traditional supervised learning approach has poor perfor- 
mance guarantees due to the quadratic growth in T. Instead 
we would prefer approaches that can guarantee growth lin- 
ear or near-linear in T and e. The following two approaches 



from Ross and Bagnell (2010) achieve this on some classes 
of imitation learning problems, including all those where 
surrogate loss £ upper bounds C. 

2.2 Forward Training 



The forward training algorithm introduced by Ross and 



Bagnell (2010) trains a non-stationary policy (one policy 
7Tt for each time step t) iteratively over T iterations, where 
at iteration t, ir t is trained to mimic tt* on the distribution 
of states at time t induced by the previously trained poli- 
cies 7Ti, 7T2, . . . , 7r t _i. By doing so, 7Tf is trained on the 
actual distribution of states it will encounter during exe- 
cution of the learned policy. Hence the forward algorithm 
guarantees that the expected loss under the distribution of 
states induced by the learned policy matches the average 
loss during training, and hence improves performance. 

We here provide a theorem slightly more general than the 
one provided by Ross and Bagnell (2010) that applies to 
any policy tt that can guarantee e surrogate loss under its 
own distribution of states. This will be useful to bound the 
performance of our new approach presented in Section [3] 

Let Ql (s, 7r) denote the i-step cost of executing tt in initial 
state s and then following policy tt' and assume £(s, tt) is 
the 0-1 loss (or an upper bound on the 0-1 loss), then we 
have the following performance guarantee with respect to 
any task cost function C bounded in [0,1]: 

Theorem 2.2. Let tt be such that E s ^,d w [£(s, tt)] = e, and 
Q^_ t+1 (s, a) — Q^_ t+1 (s,TT*) < u for all action a, t G 
{1,2,...,T}, dl(s) > 0, then J{tt) < J(tt*)+uT6. 



Proof. We here follow a similar proof to Ross and Bagnell 



(2010 1. Given our policy tt, consider the policy m-.t, which 
executes tt in the first i-steps and then execute the expert 
tt* . Then 

J(tt) 

= J ( 7r *) + J2t=l E s~di[QT-t+l( S ' 7T ) - ( ?T-t+l( S ' 7r *)] 

< J(tt*) + uJ2f =1 E s ^ di [£(s,Tr)} 
= J{TT*)+uTe 

The inequality follows from the fact that £(s, tt) upper 
bounds the 0-1 loss, and hence the probability tt and tt* 
pick different actions in s; when they pick different actions, 
the increase in cost-to-go < u. □ 

In the worst case, u could be 0(T) and the forward al- 
gorithm wouldn't provide any improvement over the tra- 



ditional supervised learning approach. However, in many 
cases u is 0(1) or sub-linear in T and the forward algo- 
rithm leads to improved performance. For instance if C is 
the 0-1 loss with respect to the expert, then u < 1. Addi- 
tionally if tt* is able to recover from mistakes made by tt, in 
the sense that within a few steps, tt* is back in a distribution 
of states that is close to what tt* would be in if tt* had been 
executed initially instead of tt, then u will be O(l). A 
drawback of the forward algorithm is that it is impractical 
when T is large (or undefined) as we must train T different 
policies sequentially and cannot stop the algorithm before 
we complete all T iterations. Hence it can not be applied 
to most real-world applications. 

2.3 Stochastic Mixing Iterative Learning 



SMILe, proposed by |Ross and Bagnell] ( |2010| ), alleviates 
this problem and can be applied in practice when T is 
large or undefined by adopting an approach similar to 
SEARN ( )Daume III et al.| |2009[ ) where a stochastic sta- 
tionary policy is trained over several iterations. Initially 
SMILe starts with a policy ttq which always queries and 
executes the expert's action choice. At iteration n, a pol- 
icy ir n is trained to mimic the expert under the distribu- 
tion of trajectories 7r„_i induces and then updates ir n = 
7r„_i + a(l — a) n ~ 1 (TT n — 7To). This update is interpreted 
as adding probability a(l — a)" -1 to executing policy TT n 
at any step and removing probability a(l — a)" -1 of ex- 
ecuting the queried expert's action. At iteration n, tt 71 is 
a mixture of n policies and the probability of using the 
queried expert's action is (1 — a) n . We can stop the al- 
gorithm at any iteration N by returning the re-normalized 

policy 7Tjy = ^"{I^iI^n — w hich doesn't query the expert 
anymore. Ros s and Bagnell| ( |2010| ) showed that choosing 



a in 0(^2) and N in 0(T A logT) guarantees near-linear 
regret in T and e for some class of problems. 

3 DATASET AGGREGATION 

We now present DAGGER (Dataset Aggregation), an it- 
erative algorithm that trains a deterministic policy that 
achieves good performance guarantees under its induced 
distribution of states. 

In its simplest form, the algorithm proceeds as follows. 
At the first iteration, it uses the expert's policy to gather 
a dataset of trajectories T> and train a policy tt^ that best 
mimics the expert on those trajectories. Then at iteration 
n, it uses TT n to collect more trajectories and adds those 
trajectories to the dataset T>. The next policy 7r„ + i is the 
policy that best mimics the expert on the whole dataset T>. 



This is the case for instance in Markov Desision Processes 
(MDPs) when the Markov Chain defined by the system dynamics 
and policy tt* is rapidly mixing. In particular, if it is a-mixing 
with exponential decay rate 8 then u is Q( 1 _ eX p(_ a y )- 
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Initialize V «- 0. 

Initialize tt\ to any policy in IT. 

for i — 1 to N do 

Let TTi = fair* + (1 - fa)iTi. 
Sample T-step trajectories using TXi. 
Get dataset T>i = {(s, tt*(s))} of visited states by TTi 
and actions given by expert. 
Aggregate datasets: T> <— r D{J r D i . 
Train classifier Tii+\ on T>. 
end for 

Return best i\i on validation. 



Algorithm 3.1: D Agger Algorithm. 



In other words, DAGGER proceeds by collecting a dataset 
at each iteration under the current policy and trains the next 
policy under the aggregate of all collected datasets. The in- 
tuition behind this algorithm is that over the iterations, we 
are building up the set of inputs that the learned policy is 
likely to encounter during its execution based on previous 
experience (training iterations). This algorithm can be in- 
terpreted as a Follow -The-Leader algorithm in that at itera- 
tion n we pick the best policy 7r„+i in hindsight, i.e. under 
all trajectories seen so far over the iterations. 

To better leverage the presence of the expert in our imita- 
tion learning setting, we optionally allow the algorithm to 
use a modified policy TTi — fair* + (1 — fa)i\i at iteration 
i that queries the expert to choose controls a fraction of the 
time while collecting the next dataset. This is often desir- 
able in practice as the first few policies, with relatively few 
datapoints, may make many more mistakes and visit states 
that are irrelevant as the policy improves. 

We will typically use fa = 1 so that we do not have to spec- 
ify an initial policy iti before getting data from the expert's 
behavior. Then we could choose fa = p 1 ^ 1 to have a prob- 
ability of using the expert that decays exponentially as in 

ement 




SMILe and SEARN. We show below the only requir 
is that {fa} be a sequence such that j3 N ~ -i ^2 i=1 A 
as N — > oo. The simple, parameter-free version of the al- 
gorithm described above is the special case fa — I(i = 1) 
for / the indicator function, which often performs best in 
practice (see SectionBl). The general DAGGER algorithm is 
detailed in Algorithm j3.1| The main result of our analysis 
in the next section is the following guarantee for DAGGER. 
Let 7r 1: jv denote the sequence of policies tt 1: tt 2 , ■ ■ ■ , ttm- 
Assume I is strongly convex and bounded over II. Suppose 
fa < (1 — a) 1 ^ 1 for all i for some constant a independent 

of T. Let e N = min^n jj J2i=i E s~d 7ri [t(s, n)] be the 
true loss of the best policy in hindsight. Then the following 
holds in the infinite sample case (infinite number of sample 
trajectories at each iteration): 

Theorem 3.1. For DAGGER, if N is d(T) there exists a 
policy tt € 7Ti:Ar s.t. Eg^d* [^(s, tt)] < 6 N + 0(1/T) 



In particular, this holds for the policy tt = 
argmin ffe#1:JV E s ^[^(s,7r)]. If the task cost 

function C corresponds to (or is upper bounded by) the 
surrogate loss I then this bound tells us directly that 
J(tt) < TeN + 0(1). For arbitrary task cost function C, 
then if £ is an upper bound on the 0-1 loss with respect to 
tt*, combining this result with Theorem [Z2] yields that: 

Theorem 3.2. For DAGGER, if N is 0(uT) there exists a 
policy 7r € 7Ti:jv s.t. J (tt) < J{tt*) + uT^n + O(l). 

Finite Sample Results In the finite sample case, sup- 
pose we sample m trajectories with iTi at each it- 
eration i, and denote this dataset Di. Let ejy = 
min^n jj YliLi Es~D; [£(s, 7r)] be the training loss of the 
best policy on the sampled trajectories, then using Azuma- 
Hoeffding's inequality leads to the following guarantee: 

Theorem 3.3. For DAGGER, if N is 0(T 2 log(l/£)) and 
mis 0(1) then with probability at least 1 — 5 there exists a 
policy 7r € 7Ti : jv s.t. E s ^,rf. [£(s, tt)] < In + 0(1/T) 

A more refined analysis taking advantage of the strong con- 



vexity of the loss function (Kakade and Tewari 2009) may 
lead to tighter generalization bounds that require N only of 
order 6(T\og(l/S)). Similarly: 

Theorem 3.4. For DAgger, is 0(u 2 T 2 log(l/<5)) 
and m is 0(1) then with probability at least 1 — 5 there 
exists a policy tt 6 tti-.n s.t. J(tt) < J(tt*)+uT6n+0(1). 



4 THEORETICAL ANALYSIS 

The theoretical analysis of DAGGER only relies on the no- 
regret property of the underlying Follow -The-Leader algo- 



rithm on strongly convex losses ( Kakade and Tewari 2009 ) 
which picks the sequence of policies 7Ti : jv. Hence the pre- 
sented results also hold for any other no regret online learn- 
ing algorithm we would apply to our imitation learning set- 
ting. In particular, we can consider the results here a re- 
duction of imitation learning to no-regret online learning 
where we treat mini-batches of trajectories under a single 
policy as a single online-learning example. We first briefly 
review concepts of online learning and no regret that will 
be used for this analysis. 

4.1 Online Learning 

In online learning, an algorithm must provide a policy 7r n at 
iteration n which incurs a loss £ n (TT n ). After observing this 
loss, the algorithm can provide a different policy tt u+ i for 
the next iteration which will incur loss £ n +i(^n+i)- The 



It is not necessary to find the best policy in the sequence 
that minimizes the loss under its distribution; the same guarantee 
holds for the policy which uniformly randomly picks one policy 
in the sequence ni:N and executes that policy for T steps. 
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loss functions t n +i may vary in an unknown or even adver- 
sarial fashion over time. A no-regret algorithm is an algo- 
rithm that produces a sequence of policies 7Ti , tt 2 , . . . , ttn 
such that the average regret with respect to the best policy 
in hindsight goes to as N goes to oo: 



1 N 



i 



N 



N 



i=l 



(3) 



for limAr^oo = 0. Many no-regret algorithms guar- 
antee that is O(jj) (e-g- when I is strongly convex) 
HHazan et all [20061 |Kakade and Shalev-Shwartzl [2008) 
|Kakade and Tewari||2009) . 

4.2 No Regret Algorithms Guarantees 

Now we show that no-regret algorithms can be used to find 
a policy which has good performance guarantees under its 
own distribution of states in our imitation learning setting. 
To do so, we must choose the loss functions to be the loss 
under the distribution of states of the current policy chosen 
by the online algorithm: £i{ir) — E^^. [£(s, ir)]. 

For our analysis of DAGGER, we need to bound the to- 
tal variation distance between the distribution of states en- 
countered by TTi and 71",, which continues to call the expert. 
The following lemma is useful: 

Lemma 4.1. ||e^ 4 - d# 4 ||i < 2T/3,. 

Proof. Let d the distribution of states over T steps condi- 
tioned on TTi picking tt* at least once over T steps. Since TTi 
always executes 7T, over T steps with probability (1 — (3i) T 
we have d Wi = (1 - fi l ) T d %i + (1 - (1 - ft) T )d. Thus 

\\d ni — dftiWi 

= (i-(i-/3 i ) T )\\d-d* i \\ 1 

< 2(1 - (1 - ft) T ) 
<2Tft 

The last inequality follows from the fact that (1 — (3) T > 

1 — j3T for any G [0, 1], □ 

This is only better than the trivial bound | \d^. — d^. 1 1 1 < 2 
for fa < Assume is non-increasing and define 
np the largest n < N such that j3 n > ^. Let e^v 

min^en j? E «< 
icy in hindsight after N iterations and let £ max be an upper 
bound on the loss, i.e. £i(s, Tti) < £ max for all policies rti, 
and state s such that d% i (s) > 0. We have the following: 

Theorem 4.1. For DAGGER, there exists a policy tt e 
n 1:N s.t. E s „ d .[£(s,n)] < e N + In + 2 '" 



min* £ * 1:JV E w . [£(s, tt)] 

<lEf=i[^, i W^,))+2Umin(l,Tft)] 

<7w + ^W*[np +min 7re nE^Ii^( 7r ) 

= 7jv + e N + [np + T E^+i A] 

□ 



Under an error reduction assumption that for any input dis- 
tribution, there is some policy tt € II that achieves sur- 
rogate loss of e, this implies we are guaranteed to find a 
policy 7r which achieves e surrogate loss under its own 
state distribution in the limit, provided (3 N — > 0. For in- 
stance, if we choose /3j to be of the form (1 — a)* -1 , then 
7f[ n P + ft] < iki^ST + 1] and this extra 

penalty becomes negligible for N as 0(T). As we need 
at least 0(T) iterations to make 77V negligible, the num- 
ber of iterations required by DAGGER is similar to that re- 
quired by any no-regret algorithm. Note that this is not 
as strong as the general error or regret reductions consid- 



ered in (Beygelzimer et al. 2005 Ross and Bagnell 2010 



Daume III et al. 2009 1 which require only classification: 
we require a no-regret method or strongly convex surrogate 
loss function, a stronger (albeit common) assumption. 

Finite Sample Case: The previous results hold if the on- 
line learning algorithm observes the infinite sample loss, 
i.e. the loss on the true distribution of trajectories induced 
by the current policy 7r,. In practice however the algorithm 
would only observe its loss on a small sample of trajecto- 
ries at each iteration. We wish to bound the true loss under 
its own distribution of the best policy in the sequence as a 
function of the regret on the finite sample of trajectories. 

At each iteration i, we assume the algorithm samples m 
trajectories using TTj and then observes the loss £i(ii) = 
E s ~_Di (£{s, 7r)), for Di the dataset of those m trajectories. 
The online learner guarantees A 53 i=1 E 8 ^.d 4 (£(s, 7Tj)) — 
min^gn iJ]^ 1 E ;s ^ Di (£(s.7r)) < j N . Let e N = 

min w6 n jt J2iLi Es~l>; [£(s, tt)] the training loss of the 
best policy in hindsight. Following a similar analysis to 
|Cesa-Bianchi et al.| ( |2004] >, we obtain: 

Theorem 4.2. .For DAGGER, with probability at least 1—5, 
there exists a policy tt € 7Ti : jv s.t. E^d* [£(s, tt)] < ejv + 

<-maxW mN -J"' 



[£(s, tt)} the loss of the best pol- lN + Jv F/9 + J Z^= n/3 +i P*J + « 



7jv ffte average regret of %\ 



T 2i=n«+l ft]> / or 7^ f/le avera ge ^gret of tt 1:N . 



N 



[np 



ji=np + 



Proof. The last lemma implies E s ^d*. TTj)) < 
Ea-^.^i^,^)) + 2Ca X min(l,TA!). Then: 



Proof. Let Yi^ be the difference between the expected per 
step loss of i}i under state distribution c? 7ri and the aver- 
age per step loss of 7Tj under the j th sample trajectory 
with TTi at iteration i. The random variables Yij over all 
i € {1,2,..., iV} and j 6 {1,2,..., to} are all zero 
mean, bounded in [— ^ max , ^ max ] and form a martingale 
(considering the order Y n , Y 12 , . . . , Y lm ,Y 2 i, . . . , iAr m ). 
By Azuma-Hoeffding's inequality ^= 
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f max W 2 l °mN^ probability at least 1 — S. Hence, we 



obtain that with probability at least 1 — 5: 



2/„ 



[«/3 + T E^+ift] 

AT 



2l„ 



JV 

1 

27V 



1 E-Li E 



+Y[^ + rEL, + ift] 



< £/V + 7/V + 4 



21og(l/g 
mN 



2£„ 



N 



We compare performance on a race track called Star Track. 
As this track floats in space, the kart can fall off the track at 
any point (the kart is repositioned at the center of the track 
when this occurs). We measure performance in terms of the 
average number of falls per lap. For SMILe and DAGGER, 
we used 1 lap of training per iteration (^1000 data points) 
and run both methods for 20 iterations. For SMILe we 
choose parameter a — 0.1 as in |Ross and B agnell ( 2 010] >, 
and for DAGGER the parameter /3, = I(i = 1) for I the in- 
dicator function. Figure [2] shows 95% confidence intervals 
on the average falls per lap of each method after 1, 5, 10, 15 
^ and 20 iterations as a function of the total number of train- 

[np + T'Ei=na+i Pi\ ing data collected. We first observe that with the baseline 
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The use of Azuma-Hoeffding's inequality suggests we need 
Nm in 0(T 2 log(l/<5)) for the generalization error to be 
0(1 JT) and negligible over T steps. Leveraging the strong 
convexity of I as in ( Kakade and Tewari 2009) may lead to 
a tighter bound requiring only 0(T\og(T/S)) trajectories. 

5 EXPERIMENTS 

To demonstrate the efficacy and scalability of DAGGER, we 
apply it to two challenging imitation learning problems and 
a sequence labeling task (handwriting recognition). 

5.1 Super Tux Kart 

Super Tux Kart is a 3D racing game similar to the popular 
Mario Kart. Our goal is to train the computer to steer the 
kart moving at fixed speed on a particular race track, based 
on the current game image features as input (see Figure[T]i. 
A human expert is used to provide demonstrations of the 
correct steering (analog joystick value in [-1,1]) for each of 
the observed game images. For all methods, we use a linear 




Figure 1: Image from Super Tux Kart's Star Track. 

controller as the base learner which updates the steering at 
5Hz based on the vector of image features^] 



Features x: LAB color values of each pixel in a 25x19 re- 
sized image of the 800x600 image; output steering: y = w T x + 1 
where w, b minimizes ridge regression objective: L(w,b) = 
i ~}2" =1 (w T Xi + b — j/i) 2 + ^w T w, for regularizer A = 10~ 3 . 
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Figure 2: Average falls/lap as a function of training data. 

supervised approach where training always occurs under 
the expert's trajectories that performance does not improve 
as more data is collected. This is because most of the train- 
ing laps are all very similar and do not help the learner to 
learn how to recover from mistakes it makes. With SMILe 
we obtain some improvements but the policy after 20 iter- 
ations still falls off the track about twice per lap on aver- 
age. This is in part due to the stochasticity of the policy 
which sometimes makes bad choices of actions. For DAG- 
GER, we were able to obtain a policy that never falls off 
the track after 15 iterations of training. Though even after 
5 iterations, the policy we obtain almost never falls off the 
track and is significantly outperforming both SMILe and 
the baseline supervised approach. Furthermore, the policy 
obtained by DAGGER is smoother and looks qualitatively 
better than the policy obtained with SMILe. A video avail- 



able on YouTube (Ross 2010a I shows a qualitative com- 
parison of the behavior obtained with each method. 

5.2 Super Mario Bros. 

Super Mario Bros, is a platform video game where the 
character, Mario, must move across each stage by avoid- 
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ing being hit by enemies and falling into gaps, and before 
running out of time. We used the simulator from a recent 
Mario Bros. AI competition (Togeli us and Karakovskiy] 
2009 ) which can randomly generate stages of varying diffi- 
culty (more difficult gaps and types of enemies). Our goal 
is to train the computer to play this game based on the cur- 
rent game image features as input (see Figure |3J. Our ex- 
pert in this scenario is a near-optimal planning algorithm 
that has full access to the game's internal state and can 
simulate exactly the consequence of future actions. An ac- 
tion consists of 4 binary variables indicating which subset 
of buttons we should press in {left,right,jump,speed}. For 
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Figure 3: Captured image from Super Mario Bros. 

all methods, we use 4 independent linear SVM as the base 
learner which update the 4 binary actions at 5Hz based on 
the vector of image features^] 

We compare performance in terms of the average distance 
travelled by Mario per stage before dying, running out of 
time or completing the stage, on randomly generated stages 
of difficulty 1 with a time limit of 60 seconds to complete 
the stage. The total distance of each stage varies but is 
around 4200-4300 on average, so performance can vary 
roughly in [0,4300]. Stages of difficulty 1 are fairly easy 
for an average human player but contain most types of en- 
emies and gaps, except with fewer enemies and gaps than 
stages of harder difficulties. We compare performance of 
DAgger, SMILe and SEARN^jto the supervised approach 
(Sup). With each approach we collect 5000 data points per 
iteration (each stage is about 150 data points if run to com- 
pletion) and run the methods for 20 iterations. For SMILe 
we choose parameter a = 0.1 (SmO.l) as in Ross and Bag- 



For the input features x: each image is discretized in a grid 
of 22x22 cells centered around Mario; 14 binary features de 



scribe each cell (types of ground, enemies, blocks and other spe- 
cial items); a history of those features over the last 4 images is 



used, in addition to other features describing the last 6 actions 
and the state of Mario (small,big,fire,touches ground), for a to- 
tal of 27152 binary features (very sparse). The k Tn output binary 
variable vi- = Uwi x + bh > Oh where wTTb fc optimizes the 
SVM objective with regularizer A — using stochastic gradi- 
ent descent (Ratliff et al. 2007 ; Bottoul 



12009}. 



"'We use the same cost-to-go approximation in Daume III et al. 
I|2009|>; in this case SMILe and SEARN differs only in howlhe 



weights in the mixture are updated at each iteration. 



|nell| P010| l. For DAGGER we obtain results with differ- 
ent choice of the parameter fa: 1) fa = I(i — 1) for I 
the indicator function (DO); 2) fa = p 1 ^ 1 for all values 
of p e {0.1, 0.2, . . . , 0.9}. We report the best results ob- 
tained with p = 0.5 (D0.5). We also report the results with 
p = 0.9 (DO. 9) which shows the slower convergence of 
using the expert more frequently at later iterations. Simi- 
larly for SEARN, we obtain results with all choice of a in 
{0.1, 0.2, . . . , 1}. We report the best results obtained with 
a = 0.4 (Se0.4). We also report results with a = 1.0 
(Sel), which shows the unstability of such a pure policy 
iteration approach. Figure [4] shows 95% confidence inter- 
vals on the average distance travelled per stage at each it- 
eration as a function of the total number of training data 
collected. Again here we observe that with the supervised 
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Figure 4: Average distance/stage as a function of data. 

approach, performance stagnates as we collect more data 
from the expert demonstrations, as this does not help the 
particular errors the learned controller makes. In particu- 
lar, a reason the supervised approach gets such a low score 
is that under the learned controller, Mario is often stuck at 
some location against an obstacle instead of jumping over 
it. Since the expert always jumps over obstacles at a sig- 
nificant distance away, the controller did not learn how to 
get unstuck in situations where it is right next to an ob- 
stacle. On the other hand, all the other iterative methods 
perform much better as they eventually learn to get unstuck 
in those situations by encountering them at the later iter- 
ations. Again in this experiment, DAGGER outperforms 
SMILe, and also outperforms SEARN for all choice of a 
we considered. When using fa = 0.9 Z_1 , convergence is 
significantly slower could have benefited from more itera- 
tions as performance was still improving at the end of the 
20 iterations. Choosing 0.5 l_1 yields slightly better per- 
formance (3030) then with the indicator function (2980). 
This is potentially due to the large number of data gener- 
ated where mario is stuck at the same location in the early 
iterations when using the indicator; whereas using the ex- 
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pert a small fraction of the time still allows to observe those 
locations but also unstucks mario and makes it collect a 
wider variety of useful data. A video available on YouTube 
( |Ross| |2010b[ l also shows a qualitative comparison of the 
behavior obtained with each method. 

5.3 Handwriting Recognition 

Finally, we demonstrate the efficacy of our approach on a 
structured prediction problem involving recognizing hand- 
written words given the sequence of images of each charac- 



ter in the word. We follow |Daume III et al. (2009 ) in adopt- 
ing a view of structured prediction as a degenerate form of 
imitation learning where the system dynamics are deter- 
ministic and trivial in simply passing on earlier predictions 
made as inputs for future predictions. We use the dataset 



of Taskar et al. ( 2003[ ) which has been used extensively in 
the literature to compare several structured prediction ap- 
proaches. This dataset contains roughly 6600 words (for 
a total of over 52000 characters) partitioned in 10 folds. 
We consider the large dataset experiment which consists of 
training on 9 folds and testing on 1 fold and repeating this 
over all folds. Performance is measured in terms of the 
character accuracy on the test folds. 

We consider predicting the word by predicting each charac- 
ter in sequence in a left to right order, using the previously 
predicted character to help predict the next and a linear 
SVlvQ following the greedy SEARN approach in Daume 
|III et al.| ( f2009| l. Here we compare our method to SMILe, 
as well as SEARN (using the same approximations used 
in Daume III et al. (2009 1). We also compare these ap- 



proaches to two baseline, a non-structured approach which 
simply predicts each character independently and the su- 
pervised training approach where training is conducted 
with the previous character always correctly labeled. Again 
we try all choice of a € {0.1, 0.2, . . . , 1} for SEARN, and 
report results for a = 0.1, a = 1 (pure policy iteration) 
and the best a = 0.8, and run all approaches for 20 itera- 
tions. Figure[5]shows the performance of each approach on 
the test folds after each iteration as a function of training 
data. The baseline result without structure achieves 82% 
character accuracy by just using an S VM that predicts each 
character independently. When adding the previous charac- 
ter feature, but training with always the previous character 
correctly labeled (supervised approach), performance in- 
creases up to 83.6%. Using DAgger increases performance 
further to 85.5%. Surprisingly, we observe SEARN with 
a = 1, which is a pure policy iteration approach performs 
very well on this experiment, similarly to the best a = 0.8 
and DAgger. Because there is only a small part of the in- 
put that is influenced by the current policy (the previous 



7 Each character is 8x16 binary pixels (128 input features); 26 
binary features are used to encode the previously predicted let- 
ter in the word. We train the multiclass SVM using the all-pairs 
reduction to binary classification ( Beygelzimer et al. 2005 i. 
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Figure 5: Character accuracy as a function of iteration. 



predicted character feature) this makes this approach not 
as unstable as in general reinforcement/imitation learning 
problems (as we saw in the previous experiment). SEARN 
and SMILe with small a = 0.1 performs similarly but sig- 
nificantly worse than DAgger. Note that we chose the sim- 
plest (greedy, one-pass) decoding to illustrate the benefits 
of the DAGGER approach with respect to existing reduc- 
tions. Similar techniques can be applied to multi-pass or 
beam-search decoding leading to results that are competi- 
tive with the state-of-the-art. 

6 FUTURE WORK 

We show that by batching over iterations of interaction 
with a system, no-regret methods, including the presented 
DAGGER approach can provide a learning reduction with 
strong performance guarantees in both imitation learning 
and structured prediction. In future work, we will consider 
more sophisticated strategies than simple greedy forward 
decoding for structured prediction, as well as using base 
classifiers that rely on Inverse Optimal Control ( Abbeel and 
|Ng] |2004| |Ratliff et aT] |2006| l techniques to learn a cost 
function for a planner to aid prediction in imitation learn- 
ing. Further we believe techniques similar to those pre- 
sented, by leveraging a cost-to-go estimate, may provide 
an understanding of the success of online methods for rein- 
forcement learning and suggest a similar data-aggregation 
method that can guarantee performance in such settings. 
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